Method of selecting seeds for the clustering of key-frames

ABSTRACT

The method is characterized in that it implements the following steps: random drawing of p candidates from the set of key images, calculation of a cost C for each candidate, selection of the candidate minimizing the cost C, determination of a subset from among the set of key images such that the key images forming the said subset have a distance from the candidate less than a threshold T, determination of a seed from among the key images of the subset such that it minimizes the cost function C for this subset, deletion of the key images of the subset to form a new set of key images for at least one new random draw and determination of a new seed according to the previous 5 steps. The field is that of the selection of shots of interest in a video sequence.

The invention relates to a method of selecting seeds for the grouping ofkey images of a video sequence as well as to the method of grouping. Italso relates to a method of automatic extraction of shots of interest ina video sequence.

The field is that of index construction for intra-video interactivenavigation or that of video structuring, that is to say the ordering ofthe shots which makes it possible to define a table of contents.

A video sequence is composed of shots, each shot corresponding topicture breaks, which may themselves be grouped into scenes. Structuringinvolves a step of classifying the shots. The latter step is of coursepossible only on condition that the content of the video can bestructured, for example in fields such as sport, televised news,interviews, etc.

Usually, the classes are defined beforehand by supervised orunsupervised learning procedures, then the candidate shots of a videoare attached to one of the classes, on the basis of a similaritymeasure.

The classification of the shots is a fundamental step of the videostructuring process. Numerous procedures for classifying andrepresenting shots are proposed, but few concern themselves with theidentification of these classes.

For example, in a video summary creation context, the article by S.Uchihashi, J. Foote, A. Girgensohn, J. Boreczsky entitled “Vidéo Manga:Generating Semantically Meaningful Video Summaries”, Proc. ACMMultimedia, Orlando, Fla., pp 383-392, November 1999, describes ahierarchical grouping procedure within the step of structuring thesequence. The result is represented in the form of a tree. Oninitialization, each image of the video is assigned to a class orcluster. Then similar images are grouped together by iteratively mergingthe two closest structures at each step. At the root, one finds themaximum cluster containing the set of images. Henceforth the desirednumber of clusters is selected by specifying the distance of the mergedclusters from their parent. By this procedure, similar shots are groupedtogether, but no information regarding the nature of the shots is found.

On the other hand for the structuring of televised news, as described inthe article by H. J Zhang, S. Y. Tan, S. W. Smoliar, G. Yihong entitled“Automatic parsing and indexing of news video”, Multimedia Systems,2(6):256-265, 1995, one seeks to distinguish two types of shots: thoseconcerning the presenter and those relating to reporter footage. Theshots of the presenter are identified with the aid of spatialcharacteristics: typically, a person in the foreground and an inlay inthe top right or left. The first step consists in defining a model A ofthe image representative of a shot of the presenter. In the second step,the shots are labelled as belonging to A or otherwise, with the aid of ameasure of similarity using local descriptors, the key image previouslybeing segmented into regions. In this procedure, a shot of interest ismodelled first, then all the shots which come close to this model areselected.

Another application of the selection of shots of interest is to identifythe shots concerning the interviewee and those of the interviewer in avideo of an interview. In this approach, for example described in thearticle by O. Javed, S. Khan, Z. Rasheed, M. Shah entitled “A Frameworkfor Segmentation of Interview Videos”, IASTED Intl. Conf. Internet andMultimedia Systems and Applications, Las Vegas, November, 2000, one ismore interested in the information carried by the transitions betweenshots, coupled with the knowledge of the structure of an interviewvideo, alternate shots of the interviewer and of the interviewee, thanin the analysis of the content of the scene. However, a skin detectionalgorithm is used to determine the number of people in the image. Sincethe questions are typically shorter than the answers, the assumptionused is that the shots of the interviewer are among the shortest. Thekey images of the N shortest shots containing just one person arecorrelated to find the most repetitive shot. One thus obtains an N×Ncorrelation matrix whose rows are summed. The key image corresponding tothe maximum sum is then identified as the key image of the interviewer.It is again correlated with all the other images to find all the shotsconcerned therewith.

FIG. 1 represents, in a known manner, a general scheme of theconstruction of video summaries. In a first step referenced 1, thesequence of images is split into shots, the shots corresponding topicture breaks. For each shot, one or more characteristic images areselected, these being key images. This is the object of step 2. For eachkey image, a signature is calculated in step 3, using local descriptorsor attributes, for example colour, contours, texture, etc. Step 4performs a selection of shots of interest as a function of thesesignatures or attributes and a summary is made in step 5 on the basis ofthese shots of interest.

The nature of the shots of interest varies as a function of the intendedapplication. For example, for televised news, it may involve thepresenter. These shots of interest often correspond to the prevalentshots, that is to say to a dominant picture. Specifically, in certainsequences, in particular sports sequences, the most interesting momentsare characterized by a common and repetitive picture in the course ofthe sequence, for example during a football, tennis, baseball match,etc.

The invention is more particularly related to the step of selecting theshots of interest. The procedure proposed is based on the signature ofeach key image, associated with a metric, so as to determine in a binarymanner whether or not the shots belong to the class of shots ofinterest.

Relating to partitioning or “clustering”, numerous algorithms exist.Found to be among the most used is the K-means based on the calculationof the barycentre of the attributes or its variant the K-medoid whichtakes into account the physical point, that is to say the image closestto the barycentre, which are iterative algorithms. From an initialpartition, the K-means or K-medoid group the data together into a fixednumber of classes. This grouping is very sensitive to the initialpartition. Moreover, it requires the a priori fixing of the number ofclasses, that is to say a priori knowledge of the content of the video.In the converse case, it does not guarantee the obtaining of an optimalpartitioning of the video sequence processed.

An aim of the invention is to alleviate the aforesaid drawbacks. Itssubject is a method of selecting seeds from a set of key images of avideo sequence for the grouping of key images of prevalent shots of thevideo sequence, characterized in that it implements the following steps:

random drawing of p candidates from the set of key images, p beingcalculated in such a way as to obtain a very good probability of drawinga key image of a prevalent shot,

calculation of the cost C for each candidate, dependent on the distancefrom the key images of the set to that of the candidate, the distancerelating to the signatures,

selection of the candidate (k1) minimizing the cost C,

determination of a subset (Ik) from among the set of key images suchthat the key images forming the said subset have a distance from thecandidate less than a threshold T, —determination of a seed (k2) fromamong the key images of the subset (Ik) such that it minimizes the costfunction C for this subset,

deletion of the key images of the subset (Ik) to form a new set of keyimages for at least one new random draw and determination of a new seedaccording to the previous 5 steps.

According to a particular implementation, the random draw is of theMonte-Carlo type, p being calculated by the Monte-Carlo formula.

According to a particular implementation, the key images are weighted,as regards their signature, as a function of the length of the shots ofthe video sequence that they characterize and the random draw is biasedby the weight of the key images.

The invention also relates to a method of grouping (clustering) shots ofa sequence of video images, the sequence being split into shots, a shotbeing represented by one or more key images, at least one signature orattribute being calculated for the key images, comprising a phase ofpartitioning the key images on the basis of a comparison of theattributes of the key images, characterized in that it comprises a phaseof initialization for the selection of at least two key images or seedson the basis of which the comparisons for the grouping are performed,the selection being performed according to the method of claim 1.

According to a particular implementation, the method is characterized inthat the initialization and partitioning phases are iterativelyrepeated, the key images of the most compact cluster obtained in theprevious iteration being eliminated from the set processed at thisprevious iteration so as to provide a new set on which the new iterationis performed.

The invention also relates to a method of selecting shots of interest,these shots being prevalent in the video sequence, characterized in thatit implements the method described above, the shots of interestcorresponding to the grouping performed about the first seed selected.

The bigger the number n of seeds picked to initialize the clusteringalgorithm, the more compact and hence coherent are the clusters in thesense of the metric used. The number n of seeds picked to initialize thealgorithm is fixed but the number of classes obtained is not known apriori.

The method allows the identification of the prevalent shots in thesequence. During the first iteration, the first cluster obtainedrepresents these prevalent shots. The subsequent iterations make itpossible to refine the classification and to reduce the rate of poorclassifications. It is thus possible to retrieve the whole set of shotsconcerning a prevalent picture so as to eliminate the secondarysequences. Assuming that these prevalent shots are shots of interest,the method makes it possible to automatically extract the shots ofinterest from the video sequence.

Other features and advantages of the invention will become clearlyapparent in the following description given by way of non-limitingexample and offered in conjunction with the appended figures whichrepresent

FIG. 1, a general scheme of the construction of video summaries,

FIG. 2, an algorithm for selecting seeds,

FIG. 3, the result of the clustering on a tennis sequence.

The processing algorithm for partitioning the images operates in twosuccessive phases. The first consists in selecting candidate shots forthe grouping, this being the algorithm for selecting seeds. The objectof the second phase is to assign each shot to one of the groupsrepresented by the seeds, this being the actual algorithm forclassifying the shots. The phase of selecting the seeds is based on theassumption of prevalent shots and ensures that during the firstiteration of the partitioning algorithm, a shot corresponding to thepicture of interest is selected first. Shots belonging to the so-calledclass of interest are then labelled “shots of interest”, the others“shots of non-interest”.

The first phase consists in a selection of the representatives of the“interest”/“non-interest” classes.

The assumption on which the selection is based is that the shotsbelonging to the picture that we are seeking are prevalent in terms ofnumber of images in the set of the sequence. We make the assumption thatat least half of the key images representing our shots do indeedcorrespond to the model sought.

In order to give more significance to lengthy shots and to satisfy theassumption, a coefficient representative of the significance of theshot, in terms of relative length, is attached to each key image, togive greater weight to the prevalent shots. This weighting coefficientis taken into account in the subsequent steps, in particular in thecalculations of distance. It would of course be equally conceivable toattach several key images to a shot in proportion to the latter'slength. In this case, on account of the more significant volume of datato be processed, the processing time would be increased.

Initialization of the Algorithm

The step of initialization, which itself constitutes the algorithm forselecting seeds, of the partitioning (clustering) algorithm consists infinding seeds for the classification in the space of signatures. Thenumber of images being significant, it is carried out by random drawingof p key images. In order to ensure, under the assumption of prevalentshots of interest, that at least one key image representative of aprevalent shot is drawn, the number p is calculated according to theMonte-Carlo sampling procedure. In this formula, the data contaminationrate is biased by the weight of the key images. The Monte-Carlo samplingprocedure is known and described for example is the article by P. Meer,D. Mintz, A. Rosenfeld entitled: “Robust regression methods for computervision: a review”—International Journal of Computer Vision, Vol:6, No.1, P. 59-70 (1991). It is necessary to ensure also that the same draw isnever performed twice. Only one key image out of the P images drawn willbe picked as seed for the initialization of the clustering algorithm, asindicated hereinbelow.

FIG. 2 describes an algorithm for selecting n seeds. The variousrectangles referenced A to F represent the set of key images such as itevolves during the processing.

Step 6 groups together the set of candidate key images in the set ofsignatures. At the outset, that is to say during the first iteration,this is the set of key images of the video sequence processed. This setis represented by the rectangle A. These images are characterized bytheir signatures, for example the dominant colours which are thecomponents of a multidimensional vector allocated to each image.

The next step 7 performs a random drawing of a candidate according to aMonte-Carlo type numerical sampling procedure. The next step 8calculates, for the image drawn, its cost.

For example, this cost may be defined by the function:C=Σf(e _(i) ²) with:

e_(i)=weighted quadratic distance between the signature picked, that isto say that of the image selected or candidate, and the signature ofimage i of the set,f(e _(i) ²)=e _(i) ² if e _(i) ² <T ²f(e _(i) ²)=T ² if e _(i) ² ≧T ²

where T is the standard deviation of the distribution of the weighteddistances from the image selected.

Steps 7 and 8 are repeated p times, p being a value calculated by theMonte-Carlo formula. The value p thus corresponds to the minimum drawmaking it possible to guarantee with a high probability that a key imagerepresenting a prevalent shot has been drawn. This probability dependson the rate of contamination, that is to say on the percentage of keyimages of interest in the set. For example p is of the order of 10 for aprobability of 99% and a minimum contamination rate of 50%. Of course, arandom draw according to another procedure may be performed in step 7,the number p being related to this probability that at least one keyimage representing a prevalent shot has been drawn. One thereforeobtains p candidates, elements represented in black in the rectangle B.Out of the p candidates to which p costs are allocated, a selection isperformed which consists in choosing the candidate k1 corresponding tothe lowest cost, this is the object of step 9. Given the assumptions,this candidate corresponds to the key image. This element is designatedby an arrow in the rectangle C. Step 10 carries out a calculation of thestandard deviation T of the distribution with respect to the candidatek1.

The next step 11 determines, for the draw or candidate K₁ picked, thesubset I_(k) of the elements of which the distance from the candidate tothe signatures is less than a threshold T. Here this is the standarddeviation of the distribution of distances from the candidate but thisthreshold could equally well be a value fixed a priori. This subset issurrounded in the rectangle D.

The determination of the seed k2, in the subset Ik, to initialize theK-medoid, is performed in the next step 13. This is the element of thesubset Ik minimizing the cost function C. It is a local minimum. Step 14stores this seed k2. This seed is designated by an arrow in the subsetIk represented in the rectangle F.

The iteration is performed by looping back from step 11 to step 6 by wayof a step 12. After determination of the subset Ik, step 11, step 12eliminates from the set of key images, the elements making up thissubset. The set of candidate key images is therefore restricted bydiscarding the elements of the subset Ik, which contains key images thatare too close to the seed previously found. The elements of the new set,represented in the rectangle E, are grouped together in step 6 andutilized for a new iteration. The number of iterations, that is to sayof seeds selected is fixed at n, n being a predefined value. Step 14therefore stores n seeds k2.

On account of the assumption, the shots sought being prevalent, theweights of each of the images which represent them are among the mostsignificant. This guarantees that they correspond to the most compactgroup within the metric sense used. We are then certain that the firstseed sought corresponds to a representative of the “interest” class.

The second phase consists in the implementation of the algorithm forpartitioning the shots.

The partition or grouping of the shots on the basis of the seeds isperformed in a conventional manner by grouping the key images ofsignatures that are closest to those of the seeds. To each seed foundthere corresponds a group. Each shot represented by its signature isattributed to the closest seed. If a signature is equidistant from twoseeds or too distant from the seeds, it is picked as a new seed and thepartitioning of the shots recommences taking account of this new seed.

The shots or key images of the cluster corresponding to the first seedare labelled “interest”, the other shots are labelled “non-interest”.

This procedure is not robust in the sense that the classification is notoptimal, since a shot is compelled to be associated with one of theclusters predetermined by the initialization except in one of the twocases cited above. A particular implementation of the invention consistsin carrying out an iteration of the partitioning algorithm making itpossible to render the procedure robust.

Once all the shots have been grouped into clusters, the mean and thestandard deviation of the distribution of the distances from the seedare calculated for each cluster obtained. Only the most compact clusteris picked. The other clusters are “released”, that is to say a new setconsisting of these other clusters alone is utilized for the subsequentimplementation of the initialization and classification algorithms. Theinitialization and classification processes are therefore repeated forall the remaining key images. An iteration is therefore performed on thebasis of a set obtained by eliminating from the set corresponding to theprevious iteration, the most compact cluster found during this previousiteration.

Several criteria for stopping the iterations may be implemented:

a single seed is selected, that is to say the initialization processgenerates just one seed. The candidate key images processed by theinitialization algorithm are sufficiently close together to correspondto the subset Ik and hence to a single seed;

the averages of the intra-cluster distances are almost equal. Statedotherwise, a new iteration of the partitioning algorithm will not affordany extra information;

there no longer remain sufficient key images and unit clusters would beobtained. The candidate key images are very different, generating onlysubsets Ik of a single image.

The algorithm has been deployed and tested on several tennis sequences.

FIG. 3 shows the result of the clustering algorithm on a tennis sequencecontaining 97 shots, with n=3.

Cluster No. 1, at the top of the figure, represents the shots ofinterest, here the pictures of the tennis court. The other clusterscontain close-ups and pictures of the public.

Applications relate for example to the creation of a lengthy summary byconcatenating the shots selected. The shots selected may also serve asinput to a more complex structuring algorithm that may be based on apriori knowledge of the picture.

In the example described, the signature used is a dominant coloursvector associated with a simplified quadratic distance. Other criteriamaking it possible to characterize the video sequences to bepartitioned, for example texture, contours, etc. may be contemplated.These criteria are chosen in such a way as to be able to characterizeshots of interest.

1. Method of selecting seeds from a set of key images of a videosequence for the grouping of key images of prevalent shots of the videosequence, comprising the following steps: random drawing of p candidatesfrom the set of key images, p being calculated in such a way as toobtain a very good probability of drawing a key image of a prevalentshot, calculation of the cost C for each candidate, dependent on thedistance from the key images of the set to that of the candidate, thedistance relating to the signatures, selection of the candidate (k1)minimizing the cost C, determination of a subset (lk) from among the setof key images such that the key images forming the said subset have adistance from the candidate less than a threshold T, determination of aseed (k2) from among the key images of the subset (Ik) such that itminimizes the cost function C for this subset, deletion of the keyimages of the subset (lk) to form a new set of key images for at leastone new random draw and determination of a new seed according to theprevious 5 steps.
 2. Method according to claim 1, wherein the randomdraw is of the Monte-Carlo type, p being calculated by the Monte-Carloformula.
 3. Method according to claim 1, wherein the key images areweighted, as regards their signature, as a function of the length of theshots of the video sequence that they characterize and in that therandom draw is biased by the weight of the key images.
 4. Methodaccording to claim 1, wherein the cost C is dependent on the quadraticdistances between the signature of the candidate and those of the keyimages of the subset and in that T is the standard deviation of thedistribution of the distances of the key images of the set from thecandidate. dependent on the distribution of the distances of the keyimages of the set from the candidate,
 5. Method according to claim 1,wherein the signature of an image relates to the dominant colour. 6.Method of grouping (clustering) shots of a sequence of video images, thesequence being split into shots, a shot being represented by one or morekey images, at least one signature or attribute being calculated for thekey images, comprising a phase of partitioning the key images on thebasis of a comparison of the attributes of the key images, comprising aphase of initialization for the selection of at least two key images orseeds on the basis of which the comparisons for the grouping areperformed, the selection being performed according to the method ofclaim
 1. 7. Method according to claim 6, wherein the partitioning phaseimplements an algorithm of the K-means or K-medoid type.
 8. Methodaccording to claim 6, wherein the initialization and partitioning phasesare iteratively repeated, the key images of the most compact clusterobtained in the previous iteration being eliminated from the setprocessed at this previous iteration so as to provide a new set on whichthe new iteration is performed.
 9. Method according to claim 8, whereinthe stopping criterion for the iterations is dependent on the number ofkey images not belonging to the most compact cluster selected or else isdependent on the averages of the intra-cluster distances.
 10. Method ofselecting shots of interest, these shots being prevalent in the videosequence, implementing the method according to claim 6, the shots ofinterest corresponding to the grouping performed about the first seedselected.